Data Science Tools: Numpy and Pandas

Both Numpy and Pandas are common tools used by Data Scientist.

  1. Numpy
    1. Multidimensional arrays and matrices
    2. Mathematical functions
  2. Pandas
    1. Structure and manipulate data in ways well suited for data analysis

Numpy

Numpy is a library that is especially useful for when you want to work with large arrays and matrices of numeric data. Numpy is optimized to run fast, much faster than working with Python lists directly.

At the foundation of Numpy is the array object class. Numpy arrays are like python lists except that every element in the array has to be of the same type.


In [2]:
import numpy as np
array = np.array([1,2,3,4,5],float)

In [3]:
import numpy as np
array = np.array([1,2,3,4,5],float)
print array
matrix = np.array([[1,2,3,4],[5,6,7,8]],float)
print matrix


[ 1.  2.  3.  4.  5.]
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]

You can index, slice, and manipulate a numpy array much like you do Python lists


In [4]:
print array[1]
print array[:2]
print array[3:]


2.0
[ 1.  2.]
[ 4.  5.]

In [5]:
print matrix[1][1]
print matrix[1,:]
print matrix[:,1]


6.0
[ 5.  6.  7.  8.]
[ 2.  6.]

You can perform arithmetic operations on Numpy arrays


In [6]:
array1 = np.array([1,2],float)
array2 = np.array([3,4],float)

print array1 + array2
print array1 - array2
print array1 * array2
print array1 / array2


[ 4.  6.]
[-2. -2.]
[ 3.  8.]
[ 0.33333333  0.5       ]

In [7]:
matrix1 = np.array([[1,2],[3,4]],float)
matrix2 = np.array([[5,6],[7,8]],float)

print matrix1 + matrix2
print matrix1 - matrix2
print matrix1 * matrix2
print matrix1 / matrix2


[[  6.   8.]
 [ 10.  12.]]
[[-4. -4.]
 [-4. -4.]]
[[  5.  12.]
 [ 21.  32.]]
[[ 0.2         0.33333333]
 [ 0.42857143  0.5       ]]

In addition to arithmetic operations, Numpy has a number of other mathematical operations you can apply (e.g., mean, median, standard deviation, dot product)


In [8]:
# Functions useful for statistical analysis
array1 = np.arange(1,6)

print array1
print np.mean(array1)
print np.median(array1)
print np.std(array1)


[1 2 3 4 5]
3.0
3.0
1.41421356237
Refresher Dot product formula: a = [a1,a2,a3] b = [b1,b2,b3] dot(a,b) = a1xb1 + a2xb2 + a3xb3 For more info on Numpy.dot go to: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html

In [9]:
# Dot product
array2 = np.arange(3,8)
print array1
print array2

print(np.dot(array1,array2))


[1 2 3 4 5]
[3 4 5 6 7]
85

In [10]:
# Dot product (matrix)
oneDArray= np.array([1,2])
twoDArray = np.array([[2,4,6],[3,5,7]])

print np.dot(oneDArray,twoDArray)


[ 8 14 20]

Pandas

Data can be stored and referenced using Pandas.

Data in Pandas is often stored in a structure called the dataframe. A dataframe is a 2D labled data structure with columns that can be different types.

Dataframe


In [11]:
from pandas import DataFrame, Series

In [12]:
# Create a dictionary where the key = name of columns and the value is a corresponding series
# Value -> 1st pass the data you want in dataframe and index where you want the data to go
#
# You can think about a Series as a one-dimensional object that is similar to an array, 
# list, or column in a database.  By default it will assign an index label to each to each 
# item in the Series ranging from zero to N-1, where N is the number of items in the series.
#
data = {'name':Series(['Braund','Cummings','Heikkinen','Allen'],index = ['a','b','c','d']),
        'age':Series([22,38,26,35],index = ['a','b','c','d']),
        'Fare':Series([7.25,71.83,8.05],index = ['a','b','d']),
        'Survived':Series([False,True,True,False],index = ['a','b','c','d'])}

In [13]:
# Pass data as a argument to the data frame function to create the actual data frame
df = DataFrame(data)
print df


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]

Accesing Data Column Data


In [14]:
# You can operate on specific columns by calling them as if they were a key in a dictionary.
# You can access one column.  When you do, you get a Series
print df['name']


a       Braund
b     Cummings
c    Heikkinen
d        Allen
Name: name, dtype: object

In [15]:
# You can access multiple columns by passing list of column names.  
# When you do, you get back a dataframe.
df[['name','age','Survived']]


Out[15]:
name age Survived
a Braund 22 False
b Cummings 38 True
c Heikkinen 26 True
d Allen 35 False

4 rows × 3 columns

Subseting dataframe using true/false statements on columns

In [16]:
print df

# Select rows in which the passanger age >= 30
print df[df['age']>= 30]


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
b  71.83     True   38  Cummings
d   8.05    False   35     Allen

[2 rows x 4 columns]

In [17]:
# You can also perform the above operation on particular columns
# Example: Get 'Survived' information for passagengers whose age >= 30
# Notes: 
# df['Survived'] only picks out data from the 'Survived' column 
# df['Survived'][df['age']>= 30] picks indices where this statment 'df[age] > 30' is true.
print df
print df['Survived'][df['age'] >= 30]


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
b     True
d    False
Name: Survived, dtype: bool

Pandas Row Indexing

You can access rows through multiple ways

  1. An individual index (through the functions loc and iloc)
  2. Slicing
  3. Boolean indexing

You could also combine multiple selection requirements through boolean operators like & (and) and | (or)

Index (loc,iloc)

In [18]:
print df

# Get row corresponding to passenger "Braund", whose index is a
print df.loc['a']
print ""

# You can also access via integer position
print df.iloc[[1]]


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
Fare          7.25
Survived     False
age             22
name        Braund
Name: a, dtype: object

    Fare Survived  age      name
b  71.83     True   38  Cummings

[1 rows x 4 columns]
Boolean Indexing

In [19]:
print df

# Example: find the passengers with age >= 30
print df[df['age']>= 30]


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
b  71.83     True   38  Cummings
d   8.05    False   35     Allen

[2 rows x 4 columns]
Slicing

In [20]:
print df

print df[0:2]


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
a   7.25    False   22    Braund
b  71.83     True   38  Cummings

[2 rows x 4 columns]
Boolean Operations

In [21]:
print df

# Multiple selection using & (and)
print df[(df.Survived == True) & (df.age > 30)]


    Fare Survived  age       name
a   7.25    False   22     Braund
b  71.83     True   38   Cummings
c    NaN     True   26  Heikkinen
d   8.05    False   35      Allen

[4 rows x 4 columns]
    Fare Survived  age      name
b  71.83     True   38  Cummings

[1 rows x 4 columns]

It is possible to perform boolean indexing on specific columns

Pandas Descriptive Functions

Pandas also has various functions that help you understand some basic information about your dataframe. Some of this functions are:

  1. dtypes: to get the data type for each column
  2. describe: useful for seeing basic statistics of numeric columns in your dataframe
  3. head: displays the first 5 rows in your dataset
  4. tail: displays the last 5 rows in your dataset

In [22]:
print df.dtypes
print ""
print df.describe()


Fare        float64
Survived       bool
age           int64
name         object
dtype: object

            Fare   Survived    age
count   3.000000          4   4.00
mean   29.043333        0.5  30.25
std    37.056499  0.5773503   7.50
min     7.250000      False  22.00
25%     7.650000          0  25.00
50%     8.050000        0.5  30.50
75%    39.940000          1  35.75
max    71.830000       True  38.00

[8 rows x 3 columns]
Pandas Vectorization

What does it mean to operate on a dataframe in a vecotired way?


In [23]:
# First, lets create a dataframe
d = {'one': Series([1,2,3], index = ['a','b','c']),
     'two': Series([1,2,3,4], index = ['a','b','c','d'])}

df = DataFrame(d)

print df


   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4

[4 rows x 2 columns]

In [24]:
# Second, we apply an arbitrary function to all the columns
# in the dataframe using pf.apply.
# The result itself is a new dataframe

# Example: apply numpy.mean to every column (axis = 0 for column operation, axis = 1 for row operation)

df.apply(np.mean, axis = 0)


Out[24]:
one    2.0
two    2.5
dtype: float64
map

In [25]:
# You can apply map to a column in the dataframe or the entire dataframe.
# This functions will allow you to apply functions that take in a single value and returns
# a single value

# Example
print df['one']

#Go through every single value in 'One' and applies lambda function
print df['one'].map(lambda x: x> 1) 

# Refresher:
# Lambda functions are small inline functions that are 
# defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1, 
# or a boolean that equals True or False.
# For more info go to: https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions


a     1
b     2
c     3
d   NaN
Name: one, dtype: float64
a    False
b     True
c     True
d    False
Name: one, dtype: bool

In [26]:
# The function can be applied to all the columns of the dataframe
print df
print df.applymap(lambda x: x>1)


   one  two
a    1    1
b    2    2
c    3    3
d  NaN    4

[4 rows x 2 columns]
     one    two
a  False  False
b   True   True
c   True   True
d  False   True

[4 rows x 2 columns]

Note: map() can only be used on a Series to return a new Series and applymap() can only be used on a DataFrame to return a new DataFrame.

Exercise #1


In [27]:
'''
Compute the average of bronze medals earned by countries who earned at least one gold medal.
Save this to a variable called: avg_bronze_at_least_one_gold

HINT-1:
    You can retrieve all of the values of a Pandas column from a 
    data frame, "df", as follows:
    df['column_name']
    
    HINT-2:
    The numpy.mean function can accept as an argument a single
    Pandas column. 
    
    For example, numpy.mean(df["col_name"]) would return the 
    mean of the values located in "col_name" of a dataframe df.
'''
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
    
olympic_medal_counts = {'country_name':Series(countries),
                            'gold': Series(gold),
                            'silver': Series(silver),
                            'bronze': Series(bronze)}

df = DataFrame(olympic_medal_counts)

np.mean(df['bronze'][df['gold']>0])
np.mean(df[df['gold']>0]['bronze'])


Out[27]:
4.2380952380952381

Exercise #2

Using the dataframe's apply method, create a new Series called avg_medal_count that indicates the average number of gold, silver, and bronze medals earned amongst countries who earned at least one medal of any kind at the 2014 Sochi olympics.


In [28]:
# Original data frame
print df


    bronze    country_name  gold  silver
0        9    Russian Fed.    13      11
1       10          Norway    11       5
2        5          Canada    10      10
3       12   United States     9       7
4        9     Netherlands     8       7
5        5         Germany     8       6
6        2     Switzerland     6       3
7        1         Belarus     5       0
8        5         Austria     4       8
9        7          France     4       4
10       1          Poland     4       1
11       2           China     3       4
12       2           Korea     3       3
13       6          Sweden     2       7
14       2  Czech Republic     2       4
15       4        Slovenia     2       2
16       3           Japan     1       4
17       1         Finland     1       3
18       2   Great Britain     1       1
19       1         Ukraine     1       0
20       0        Slovakia     1       0
21       6           Italy     0       2
22       2          Latvia     0       2
23       1       Australia     0       2
24       0         Croatia     0       1
25       1      Kazakhstan     0       0

[26 rows x 4 columns]

In [29]:
# First:  Get subset of dataframe that only contains medal information
df1 = df[['gold','silver','bronze']]
print df1


    gold  silver  bronze
0     13      11       9
1     11       5      10
2     10      10       5
3      9       7      12
4      8       7       9
5      8       6       5
6      6       3       2
7      5       0       1
8      4       8       5
9      4       4       7
10     4       1       1
11     3       4       2
12     3       3       2
13     2       7       6
14     2       4       2
15     2       2       4
16     1       4       3
17     1       3       1
18     1       1       2
19     1       0       1
20     1       0       0
21     0       2       6
22     0       2       2
23     0       2       1
24     0       1       0
25     0       0       1

[26 rows x 3 columns]

In [30]:
# Second:  Keep rows in which there is at least 1 medal 
df2 = df1[(df1['gold']>0) | (df1['silver']>0) | (df1['bronze']>0)]  
print df2


    gold  silver  bronze
0     13      11       9
1     11       5      10
2     10      10       5
3      9       7      12
4      8       7       9
5      8       6       5
6      6       3       2
7      5       0       1
8      4       8       5
9      4       4       7
10     4       1       1
11     3       4       2
12     3       3       2
13     2       7       6
14     2       4       2
15     2       2       4
16     1       4       3
17     1       3       1
18     1       1       2
19     1       0       1
20     1       0       0
21     0       2       6
22     0       2       2
23     0       2       1
24     0       1       0
25     0       0       1

[26 rows x 3 columns]

In [31]:
# Third: Get average number of gold, silver, and bronze  
avg_medal_count = Series(df2.apply(np.mean))

print avg_medal_count


gold      3.807692
silver    3.730769
bronze    3.807692
dtype: float64

Exercise #3

Imagine a point system in which each country is awarded 4 points for each gold medal, 2 points for each silver medal, and 1 point for each bronze medal.

Using the numpy.dot function, create a new dataframe that includes: a) a column called 'country_name' with the country name b) a column called 'points' with the total number of points the country earned at the Sochi olympics


In [33]:
# Create an array with point system
awards = np.array([4,2,1])

data = {'country_name':df['country_name'],
        'points':Series(df[['gold','silver','bronze']].apply(lambda x:np.dot(awards,x),axis = 1))}

newDF = DataFrame(data)
print newDF


      country_name  points
0     Russian Fed.      83
1           Norway      64
2           Canada      65
3    United States      62
4      Netherlands      55
5          Germany      49
6      Switzerland      32
7          Belarus      21
8          Austria      37
9           France      31
10          Poland      19
11           China      22
12           Korea      20
13          Sweden      28
14  Czech Republic      18
15        Slovenia      16
16           Japan      15
17         Finland      11
18   Great Britain       8
19         Ukraine       5
20        Slovakia       4
21           Italy      10
22          Latvia       6
23       Australia       5
24         Croatia       2
25      Kazakhstan       1

[26 rows x 2 columns]

In [37]:
dp_function = lambda x:np.dot(awards,x);

data = {'country_name':df['country_name'],
        'points':Series(df[['gold','silver','bronze']].apply(dp_function,axis = 1))}

newDF = DataFrame(data)
print newDF


      country_name  points
0     Russian Fed.      83
1           Norway      64
2           Canada      65
3    United States      62
4      Netherlands      55
5          Germany      49
6      Switzerland      32
7          Belarus      21
8          Austria      37
9           France      31
10          Poland      19
11           China      22
12           Korea      20
13          Sweden      28
14  Czech Republic      18
15        Slovenia      16
16           Japan      15
17         Finland      11
18   Great Britain       8
19         Ukraine       5
20        Slovakia       4
21           Italy      10
22          Latvia       6
23       Australia       5
24         Croatia       2
25      Kazakhstan       1

[26 rows x 2 columns]

In [32]:

References

  1. Numpy Reference Guide
    1. http://docs.scipy.org/doc/numpy/reference/
  2. Pandas Documentation
    1. http://pandas.pydata.org/pandas-docs/stable/
  3. Pandas tutorial using Ipython notebooks
    1. https://bitbucket.org/hrojas/learn-pandas